Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, the materials will be available for download at QUERCUS and also distributed via email. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).
We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about Python and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
Welcome to this fourth lecture in a series of seven. Today we will pick up where we left off last week with our merged data. We'll learn how to explore the data, summarize it, and plot it!
At the end of this lecture we will aim to have covered the following topics:
matplotlib.pyplot package.seaborn package.grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
Today's datasets will focus on using Python lists and the NumPy package
This is a copy of our formatted and merged data from Lecture 03. It's a smaller subset to the final for doing some simple exploratory data analysis towards the end of class. The original data was obtained from phase 1 of the Human Microbiome project. The HMP was a joint effort led by the American Institute of Health to characterize the microbiome of healthy human subjects at five major body sites using metagenomics (A.K.A. environmental DNA) and 16S ribosomal RNA amplicon gene sequencing. The dataset that we will use are the results of 16S rRNA gene sequencing, a technique where this gene is used as a marker for taxonomic identification and classification.
Sequencing of the V3-V5 hypervariable regions of the 16S rRNA gene
16S rRNA gene amplicon sequencing of 30 latrines from Tanzania and Vietnam at different depths (multiples of 20cm). Microbial abundance is represented in Operational Taxonomic Units (OTUs). Operational Taxonomic Units (OTUs) are groups of organisms defined by a specified level of DNA sequence similarity at a marker gene (e.g. 97% similarity at the V4 hypervariable region of the 16S rRNA gene). Intrinsic environmental factors such as pH, temperature, organic matter composition were also recorded.
We have 3 csv files.
A metadata file (Naming conventions: [Country_LatrineNo_Depth]) with sample names and environmental variables.
An OTU abundance table containing the numbers of operational taxonomic units sampled from various sites.
B Torondel, JHJ Ensink, O Gunvirusdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince. Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines Microbial Biotechnology, 9(2):209-223, 2016. DOI:10.1111/1751-7915.12334
IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.
random is a package with methods to add pseudorandomness to programs
numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.
os
pandas
matplolib
# ----- Always run this at the beginning of class so we can get multi-command output ----- #
# Access options from the iPython core
from IPython.core.interactiveshell import InteractiveShell
# Change the value of ast_node_interactivity
InteractiveShell.ast_node_interactivity = "all"
# ----- Additional packages we want to import for class ----- #
# Import the pandas package
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
Recall from last week, we spent our lecture converting data from the human microbiome project from wide-format to long-format. Now that the wrangling is completed, we can perform some exploratory data analysis (EDA). The process of EDA investigates your data to identify abnormalities, summarize its main characteristics and identify potential patterns or trends for further validation.
With our EDA today we will try to answer questions like:
Let's open a version of our dataset from last week. We'll find it in the file data/subset_data_taxa_merged.csv. Recall that this is a representative subset of the entire dataset. It is always good practice to explore your to find more about aspects such as probability distributions, outliers, and central tendency and dispersion measures.
Now we are going to do some exploratory data analysis (EDA) on subset_taxa_metadata_merged.csv, a subset of the merged dataset (0.001%).
# Read in subset_taxa_metadata_merged.csv
merged_subset = pd.read_csv("data/subset_data_taxa_merged.csv")
# fill empty cells with NaN -default behavior
# Take a peek at the data
merged_subset.head()
# How big is this dataset?
merged_subset.info()
| OTU | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OTU_97_10788 | 700016046 | 158964549 | 1 | Male | BCM,WUGC | Skin | Left Retroauricular Crease | SRS011434 | 0 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus |
| 1 | OTU_97_10020 | 700015454 | 159025240 | 1 | Female | JCVI,BI | Airways | Anterior Nares | NaN | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas |
| 2 | OTU_97_10234 | 700015454 | 159025240 | 1 | Female | JCVI,BI | Airways | Anterior Nares | NaN | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | NaN | NaN |
| 3 | OTU_97_10174 | 700015738 | 158984779 | 1 | Female | JCVI,BI | Oral | Supragingival Plaque | NaN | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | NaN | NaN |
| 4 | OTU_97_10895 | 700024130 | 764245047 | 1 | Female | WUGC | Skin | Right Retroauricular Crease | SRS015496 | 0 | Bacteria | Firmicutes | Clostridia | Clostridiales | Veillonellaceae | Selenomonas |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 249750 entries, 0 to 249749 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OTU 249750 non-null object 1 PSN 249750 non-null int64 2 RSID 249750 non-null int64 3 VISITNO 249750 non-null int64 4 SEX 249750 non-null object 5 RUN_CENTER 249750 non-null object 6 HMP_BODY_SITE 249750 non-null object 7 HMP_BODY_SUBSITE 249750 non-null object 8 SRS_SAMPLE_ID 207200 non-null object 9 count 249750 non-null int64 10 SUPERKINGDOM 249750 non-null object 11 PHYLUM 247718 non-null object 12 CLASS 246463 non-null object 13 ORDER 243965 non-null object 14 FAMILY 223101 non-null object 15 GENUS 200238 non-null object dtypes: int64(4), object(12) memory usage: 30.5+ MB
Let's begin with a deceptively simply question about our data. As we'll see, however, it required more than a simple method call to our data in order to discern. Rather we'll walk through the thought process so you can avoid potential pitfalls later in your own analyses.
So we've imported a dataset of 999,000 rows across 16 columns. You'll note a new column named "OTU" that we didn't have in our prior dataset. Previously we had left this as an index when we were completing our merges but now it exists as it's own column. Recall that at the time of importing this we could have used the index_col parameter to import OTU as we had it before.
Time to pull up a summary of the numeric information. Remember that we can use the describe() method to accomplish this.
# Get a description of the numeric data
merged_subset.describe()
| PSN | RSID | VISITNO | count | |
|---|---|---|---|---|
| count | 2.497500e+05 | 2.497500e+05 | 249750.000000 | 249750.000000 |
| mean | 7.000227e+08 | 5.275325e+08 | 1.028144 | 0.114759 |
| std | 6.252151e+03 | 2.953423e+08 | 0.165385 | 7.156969 |
| min | 7.000135e+08 | 1.580137e+08 | 1.000000 | 0.000000 |
| 25% | 7.000161e+08 | 1.590859e+08 | 1.000000 | 0.000000 |
| 50% | 7.000236e+08 | 7.637595e+08 | 1.000000 | 0.000000 |
| 75% | 7.000246e+08 | 7.642855e+08 | 1.000000 | 0.000000 |
| max | 7.000372e+08 | 7.651959e+08 | 2.000000 | 3131.000000 |
describe() method to summarize non-numeric columns¶Recall that we can also summarize our non-numeric data, to a certain extent, as long as we provide it properly to the describe() method. This method can identify the "top" occuring entry in a column as well as it's frequency. We already know that the OTU column contains each occuring OTU so it should be simple enough to just create a summary of that information. Let's give it a try.
# Use the describe method on the OTU column from our merged_subset
merged_subset.OTU.describe()
count 249750 unique 999 top OTU_97_10362 freq 298 Name: OTU, dtype: object
Last lecture we saw a few examples where we subset our data using a simple conditional statement like isna(). While we referred to this as "slicing" our data, you can also think of it as a way to filter data. We can of course, filter our data using other conditional statements and before summarizing your data, you should consider the nature of your values! In our original susbet data it appears that each OTU appears 1000 times.
Recall, our original dataset started off with 1000 observations, and ~2000 OTU columns. When we melted the data, this generated quite a large ~2M row dataset before merging with our taxa metadata which represented ~1000 OTUs. By the virtue of this process, each OTU could appear 1000 times in our data BUT it may not have actually occured within each sample!
To solve this conundrum we turn to filtering our data by the count column. This is a measurement of the number of occurences of a specific OTU in any sample! We'll do this in two steps:
So according to our quick describe, we see that OTU_97_10362 appears most often in our data set but is that the answer to our question? Looking at the total count, there are 249750 non-NA values and 999 unique values - or an average of 250 entries per OTU! Are we sure we summarized this correctly?
# Recall that we can broadcast a conditional query to multiple values in a DataFrame
merged_subset.iloc[:,9] > 0
# Supply the result of the conditional query as a filtering criteria
merged_subset.loc[merged_subset.iloc[:,9] > 0]
0 False
1 False
2 False
3 False
4 False
...
249745 False
249746 False
249747 False
249748 False
249749 True
Name: count, Length: 249750, dtype: bool
| OTU | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22 | OTU_97_10321 | 700021312 | 763536994 | 1 | Male | WUGC | Oral | Tongue Dorsum | SRS014271 | 1 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Actinomycetaceae | Actinomyces |
| 62 | OTU_97_10400 | 700024552 | 764305738 | 1 | Male | WUGC | Oral | Palatine Tonsils | SRS015925 | 5 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus |
| 104 | OTU_97_10165 | 700037036 | 765094712 | 1 | Male | WUGC | Oral | Buccal Mucosa | SRS018595 | 2 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Aerococcaceae | Abiotrophia |
| 261 | OTU_97_10813 | 700015245 | 158802708 | 1 | Male | JCVI,WUGC | Gastrointestinal Tract | Stool | SRS011271 | 45 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides |
| 294 | OTU_97_1069 | 700016909 | 159389382 | 1 | Male | BCM | Oral | Hard Palate | SRS013776 | 1 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 249436 | OTU_97_10678 | 700014515 | 158418336 | 1 | Male | BCM,BI | Oral | Tongue Dorsum | NaN | 1 | Bacteria | Fusobacteria | Fusobacteria | Fusobacteriales | Fusobacteriaceae | Fusobacterium |
| 249532 | OTU_97_10579 | 700015290 | 158802708 | 1 | Male | JCVI,WUGC | Skin | Right Retroauricular Crease | SRS011292 | 2 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Propionibacteriaceae | Propionibacterium |
| 249547 | OTU_97_1084 | 700024908 | 764669880 | 1 | Male | WUGC | Gastrointestinal Tract | Stool | SRS016267 | 6 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | NaN |
| 249727 | OTU_97_1088 | 700024032 | 764042746 | 1 | Female | WUGC | Oral | Throat | SRS015405 | 1 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Micrococcaceae | Rothia |
| 249749 | OTU_97_10486 | 700023866 | 764143897 | 1 | Female | WUGC | Oral | Buccal Mucosa | SRS015235 | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Moryella |
4383 rows × 16 columns
# Now you can summarize information from the OTU column
merged_subset.loc[merged_subset.iloc[:,9] > 0].OTU.describe()
count 4383 unique 715 top OTU_97_10 freq 79 Name: OTU, dtype: object
We've now come to the simple answer showing that OTU_97_10 appears most frequently with a count of > 0 within our data. What is the specific information about this OTU? We can filter for this information too!
Since the output of our summary is also an indexed object, we can pull out the top value as an attribute.
# Now you can summarize information from the OTU column
top_OTU = merged_subset.loc[merged_subset.iloc[:,9] > 0]["OTU"].describe().top
merged_subset.loc[merged_subset["OTU"] == top_OTU].iloc[0]
OTU OTU_97_10 PSN 700023924 RSID 763982056 VISITNO 1 SEX Male RUN_CENTER WUGC HMP_BODY_SITE Airways HMP_BODY_SUBSITE Anterior Nares SRS_SAMPLE_ID SRS015291 count 0 SUPERKINGDOM Bacteria PHYLUM Firmicutes CLASS Clostridia ORDER Clostridiales FAMILY Veillonellaceae GENUS Veillonella Name: 1388, dtype: object
filter() method¶You may be wondering to yourself, surely there must be a filter() method implemented with the DataFrame. It seems like such an essential part of working with DataFrames. You're right that such a method exists but you would be wrong to think that it is used for conditional filtering.
The filter() method is used merely for subsetting your DataFrame by column name or by regular expression pattern (See Lecture 06). It does not use conditional logic to subset rows from your data. While this can be helpful in certain contexts, it does NOT implement the idea of filtering rows of our data based on conditional criteria.
Here's a quick example of how to use it.
# Select the OTU, HMP_BODY_SITE, and count columns
merged_subset.filter(items = ['OTU', 'HMP_BODY_SITE', 'count'])
| OTU | HMP_BODY_SITE | count | |
|---|---|---|---|
| 0 | OTU_97_10788 | Skin | 0 |
| 1 | OTU_97_10020 | Airways | 0 |
| 2 | OTU_97_10234 | Airways | 0 |
| 3 | OTU_97_10174 | Oral | 0 |
| 4 | OTU_97_10895 | Skin | 0 |
| ... | ... | ... | ... |
| 249745 | OTU_97_10840 | Skin | 0 |
| 249746 | OTU_97_10146 | Oral | 0 |
| 249747 | OTU_97_10054 | Oral | 0 |
| 249748 | OTU_97_10121 | Oral | 0 |
| 249749 | OTU_97_10486 | Oral | 1 |
249750 rows × 3 columns
While we have discovered which specific OTU appears to be most frequently encountered amongst all the samples, it may be occuring commonly but in very low amounts. Furthermore, we may be missing some information from the big picture as many OTUs may really come from the same genus. This kind of information may be more helpful to understanding our microbiome data so how do we explore this?
groupby() method¶Thinking about our problem, we have a GENUS column and we can take advantage of the values in this column by exploring data based on the unique values in GENUS. We could try to filter our data using a different conditional GENUS == Veillonella but we would have to somehow cycle through each potential value and summarize the subset. (Honestly this would have been my approach not so long ago!).
Luckily for you, the groupby() method can sort all of that data for you based on the criteria you provide. The important parameters to use today are:
by: A function, label, or list of labels that you want to use to determine grouping criteria. You can even provide a dictionary object where specific key:values pairs can be used to determine groupings.axis: How to split along the rows (0, default) or columns (1)as_index: A boolean to determine if the index labels should be based on group labels (True by default)# Recall from last week that there were only 85 unique genera in our dataset
len(merged_subset['GENUS'].unique())
# Group the subset data by 'GENUS'
merged_subset.groupby(by = ['GENUS'])
85
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001D9F10EF7F0>
head() method to view rows from each group¶As you can see above, we create a grouped DataFrame but if we attempted to look at it, it would look pretty much like the original merged_subset. The major difference is that the data has now been essentially sorted by the GENUS column. In order to view part of it, we can use the head(n) method which will return n rows from each group.
# Group the subset data by 'GENUS' and view 1 row from each group
merged_subset.groupby(by = ['GENUS']).head(1)
| OTU | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OTU_97_10788 | 700016046 | 158964549 | 1 | Male | BCM,WUGC | Skin | Left Retroauricular Crease | SRS011434 | 0 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus |
| 1 | OTU_97_10020 | 700015454 | 159025240 | 1 | Female | JCVI,BI | Airways | Anterior Nares | NaN | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas |
| 2 | OTU_97_10234 | 700015454 | 159025240 | 1 | Female | JCVI,BI | Airways | Anterior Nares | NaN | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | NaN | NaN |
| 4 | OTU_97_10895 | 700024130 | 764245047 | 1 | Female | WUGC | Skin | Right Retroauricular Crease | SRS015496 | 0 | Bacteria | Firmicutes | Clostridia | Clostridiales | Veillonellaceae | Selenomonas |
| 5 | OTU_97_10704 | 700024402 | 764224817 | 1 | Male | WUGC | Oral | Palatine Tonsils | SRS015772 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Propionibacteriaceae | Propionibacterium |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2598 | OTU_97_10336 | 700037132 | 765013792 | 1 | Female | WUGC | Oral | Supragingival Plaque | SRS018751 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Intrasporangiaceae | Terracoccus |
| 3219 | OTU_97_10717 | 700023494 | 764002286 | 1 | Male | WUGC | Oral | Saliva | SRS014857 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Odoribacter |
| 3669 | OTU_97_10426 | 700023242 | 763840445 | 1 | Female | WUGC | Airways | Anterior Nares | SRS014623 | 0 | Bacteria | Firmicutes | Clostridia | Clostridiales | Clostridiales Family XI. Incertae Sedis | Finegoldia |
| 4384 | OTU_97_10229 | 700023241 | 763840445 | 1 | Female | WUGC | Skin | Right Antecubital Fossa | SRS014621 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Microbacteriaceae | Pseudoclavibacter |
| 8536 | OTU_97_10826 | 700024689 | 764508039 | 1 | Male | WUGC | Gastrointestinal Tract | Stool | SRS016056 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Micrococcaceae | Kocuria |
85 rows × 16 columns
As expected, we see 85 rows in our data mean there are 85 groups in our grouped DataFrame! Now that we have our groups we can begin to apply functions to summarize data from each group. Back to our question, we can now ask, which genera produces the largest number of total counts.
We have already encountered the sum() and max() functions to some degree so it is simply a matter of applying them to the correct subset of data. Recall that our grouped data will be indexed by genus so we can subset our count column and then apply functions to it. Let's see what that looks like step-by-step.
# Subset the "count" column from our grouped data
merged_subset.groupby(by = ['GENUS'])["count"]
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001D9F4EEB670>
# Subset the "count" column from our grouped data
merged_subset.groupby(by = ['GENUS'])["count"].sum()
GENUS
Abiotrophia 87
Acetivibrio 14
Acinetobacter 1
Actinobacillus 3
Actinomyces 1192
...
Treponema 186
Ureaplasma 317
Veillonella 1736
Xanthomonas 92
Zoogloea 0
Name: count, Length: 84, dtype: int64
# Subset the "count" column from our grouped data
merged_subset.groupby(by = ['GENUS'])["count"].sum().max()
4164
sort_values() to organize your data¶Not exactly the result we were looking for. The max() function did reveal the largest value but we lost the index information at the same time, leaving us without a complete answer. Instead, what we can do is sort the resulting summed group data in descending order with the sort_values() method. To sort in descending or reverse order, we can set the ascending = False parameter.
# Subset the "count" column from our grouped data
merged_subset.groupby(by = ['GENUS'])["count"].sum().sort_values(ascending = False)
GENUS
Atopobium 4164
Streptococcus 2433
Prevotella 2072
Veillonella 1736
Capnocytophaga 1469
...
Brevibacterium 0
Bifidobacterium 0
Bergeyella 0
Actinomycetospora 0
Zoogloea 0
Name: count, Length: 84, dtype: int64
We can now answer our original question for this section and say that the Prevotella genus has the highest total number of counts in our dataset with a total of 10351.
# How big is the Atopobium subgroup?
merged_subset[merged_subset['GENUS'] == ...]...
# What is the mean count for the Atopobium subgroup?
merged_subset[merged_subset['GENUS'] == ...]...
Now that we've got a few very helpful tools under our belt, we can ask a similar question about the diversity of genera in within each body subsite. Let's begin by reviewing the basic information about our dataset
# Get the basic information about our dataset
merged_subset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 249750 entries, 0 to 249749 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OTU 249750 non-null object 1 PSN 249750 non-null int64 2 RSID 249750 non-null int64 3 VISITNO 249750 non-null int64 4 SEX 249750 non-null object 5 RUN_CENTER 249750 non-null object 6 HMP_BODY_SITE 249750 non-null object 7 HMP_BODY_SUBSITE 249750 non-null object 8 SRS_SAMPLE_ID 207200 non-null object 9 count 249750 non-null int64 10 SUPERKINGDOM 249750 non-null object 11 PHYLUM 247718 non-null object 12 CLASS 246463 non-null object 13 ORDER 243965 non-null object 14 FAMILY 223101 non-null object 15 GENUS 200238 non-null object dtypes: int64(4), object(12) memory usage: 30.5+ MB
# Retrieve information about the HMP_BODY_SITE
merged_subset["HMP_BODY_SUBSITE"].unique()
len(merged_subset["HMP_BODY_SUBSITE"].unique())
array(['Left Retroauricular Crease', 'Anterior Nares',
'Supragingival Plaque', 'Right Retroauricular Crease',
'Palatine Tonsils', 'Attached Keratinized Gingiva', 'Saliva',
'Subgingival Plaque', 'Right Antecubital Fossa',
'Left Antecubital Fossa', 'Tongue Dorsum', 'Hard Palate',
'Buccal Mucosa', 'Throat', 'Stool', 'Posterior Fornix',
'Vaginal Introitus', 'Mid Vagina'], dtype=object)
18
Time to think about your dataset in relationship to your question. We know already that each genera may appear within any body subsite but it's count may be 0. Again, it will be important to filter our data before summarizing it. Then you must understand how to what you are looking for and how it is a part of your data.
HMP_BODY_SUBSITE.Steps 1 and 2 should be straighforward. We'll use the first() method to look at the first entry of each group.
# Filter our data by minimum count number and then group it by body subsite
merged_subset[merged_subset["count"]>0].groupby('HMP_BODY_SUBSITE', dropna=True).first()
| OTU | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | SRS_SAMPLE_ID | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HMP_BODY_SUBSITE | |||||||||||||||
| Anterior Nares | OTU_97_10756 | 700016313 | 159085930 | 1 | Female | JCVI,WUGC | Airways | SRS011502 | 1 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Propionibacteriaceae | Propionibacterium |
| Attached Keratinized Gingiva | OTU_97_10823 | 700037037 | 765094712 | 1 | Male | WUGC | Oral | SRS018597 | 53 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus |
| Buccal Mucosa | OTU_97_10165 | 700037036 | 765094712 | 1 | Male | WUGC | Oral | SRS018595 | 2 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Aerococcaceae | Abiotrophia |
| Hard Palate | OTU_97_1069 | 700016909 | 159389382 | 1 | Male | BCM | Oral | SRS013776 | 1 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Actinomycetaceae | Actinomyces |
| Left Antecubital Fossa | OTU_97_10352 | 700016656 | 159166850 | 1 | Male | BCM,WUGC | Skin | SRS011553 | 1 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Propionibacteriaceae | Propionibacterium |
| Left Retroauricular Crease | OTU_97_10043 | 700037027 | 765094712 | 1 | Male | WUGC | Skin | SRS018577 | 1 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Propionibacteriaceae | Propionibacterium |
| Mid Vagina | OTU_97_1086 | 700024241 | 764346198 | 1 | Female | WUGC | Urogenital Tract | SRS015613 | 2 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Mycobacteriaceae | Mycobacterium |
| Palatine Tonsils | OTU_97_10400 | 700024552 | 764305738 | 1 | Male | WUGC | Oral | SRS015925 | 5 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus |
| Posterior Fornix | OTU_97_10296 | 700033151 | 159733294 | 1 | Female | BCM,WUGC | Urogenital Tract | SRS011619 | 4 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Lactobacillaceae | Lactobacillus |
| Right Antecubital Fossa | OTU_97_10623 | 700024406 | 764224817 | 1 | Male | WUGC | Skin | SRS015778 | 1 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Propionibacteriaceae | Propionibacterium |
| Right Retroauricular Crease | OTU_97_10733 | 700016134 | 159146620 | 1 | Male | BCM,WUGC | Skin | SRS011474 | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides |
| Saliva | OTU_97_10849 | 700024339 | 764467579 | 1 | Female | WUGC | Oral | SRS015704 | 1 | Bacteria | Bacteroidetes | Flavobacteria | Flavobacteriales | Flavobacteriaceae | Capnocytophaga |
| Stool | OTU_97_10813 | 700015245 | 158802708 | 1 | Male | JCVI,WUGC | Gastrointestinal Tract | SRS011271 | 45 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides |
| Subgingival Plaque | OTU_97_10913 | 700016353 | 159005010 | 1 | Female | JCVI,WUGC | Oral | SRS011520 | 1 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Prevotellaceae | Prevotella |
| Supragingival Plaque | OTU_97_10225 | 700024199 | 764285508 | 1 | Male | WUGC | Oral | SRS015574 | 4 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Micrococcaceae | Rothia |
| Throat | OTU_97_1088 | 700015075 | 158721788 | 1 | Male | JCVI,BI | Oral | SRS016012 | 1 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Micrococcaceae | Rothia |
| Tongue Dorsum | OTU_97_10321 | 700021312 | 763536994 | 1 | Male | WUGC | Oral | SRS014271 | 1 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Actinomycetaceae | Actinomyces |
| Vaginal Introitus | OTU_97_10483 | 700015571 | 158944319 | 1 | Female | JCVI,BI | Urogenital Tract | SRS014444 | 1 | Bacteria | Firmicutes | Clostridia | Clostridiales | Veillonellaceae | Dialister |
unique() method across groups¶Now that we know how to group our data by body subsite, what we really want to do is find the unique genera within each group. We already know we can use the unique() method but as we'll see the results are not quite what we expected. The nature of the grouped data generates an np.array object as a result of our following code.
# What happens if we use the unique() method?
merged_subset[merged_subset["count"]>0] \
.groupby('HMP_BODY_SUBSITE')["GENUS"] \
.unique()
HMP_BODY_SUBSITE Anterior Nares [Propionibacterium, Dialister, Corynebacterium... Attached Keratinized Gingiva [Streptococcus, Gemella, Leptotrichia, nan, Po... Buccal Mucosa [Abiotrophia, Streptococcus, Actinomyces, Bull... Hard Palate [nan, Actinomyces, Prevotella, Catonella, Bull... Left Antecubital Fossa [Propionibacterium, Moryella, Clostridium, Bac... Left Retroauricular Crease [nan, Propionibacterium, Faecalibacterium, Sta... Mid Vagina [Mycobacterium, Finegoldia, Lactobacillus, Ato... Palatine Tonsils [Streptococcus, Peptostreptococcus, Granulicat... Posterior Fornix [Lactobacillus, Veillonella, Mycobacterium, An... Right Antecubital Fossa [Propionibacterium, Corynebacterium, nan, Acti... Right Retroauricular Crease [Bacteroides, Propionibacterium, Staphylococcu... Saliva [Capnocytophaga, Veillonella, Streptococcus, P... Stool [Bacteroides, Akkermansia, Clostridium, Rosebu... Subgingival Plaque [Prevotella, Neisseria, nan, Gemella, Actinomy... Supragingival Plaque [Rothia, Streptococcus, nan, Veillonella, Acti... Throat [Rothia, Actinomyces, nan, Veillonella, Porphy... Tongue Dorsum [Actinomyces, Streptococcus, Neisseria, Prevot... Vaginal Introitus [Dialister, Lactobacillus, Finegoldia, Anaeroc... Name: GENUS, dtype: object
dropna() method to remove data with NA values¶As we can see above we have some NA values in our output. In order to avoid accidentally counting these values in our final analysis, we should remove them. There is an easy way to clean our entire dataset by removing NA values. Using the dropna() method we can remove either rows or columns that contain a single or all NA values. Some of the relevant parameters for us in this method are:
axis: Are you removing data by rows (0 or index) or by columns (1 or columns).how: Remove NA values if any (default) are present or all are NA values.thresh: Optional requirement on how many non-NA values are required before removal of data.subset: Optional column label or sequence of labels for consideration when identifying null values.inplace: Boolean to replace the dataframe you are calling from (helpful if NOT method chaining).merged_subset[merged_subset["count"]>0] \
.dropna(axis = 0, subset = ["GENUS"]) \
.groupby('HMP_BODY_SUBSITE')["GENUS"] \
.unique()
# merged_subset[merged_subset["count"]>0].dropna(axis=0).groupby('HMP_BODY_SUBSITE')["GENUS"].unique()
HMP_BODY_SUBSITE Anterior Nares [Propionibacterium, Dialister, Corynebacterium... Attached Keratinized Gingiva [Streptococcus, Gemella, Leptotrichia, Porphyr... Buccal Mucosa [Abiotrophia, Streptococcus, Actinomyces, Bull... Hard Palate [Actinomyces, Prevotella, Catonella, Bulleidia... Left Antecubital Fossa [Propionibacterium, Moryella, Clostridium, Bac... Left Retroauricular Crease [Propionibacterium, Faecalibacterium, Staphylo... Mid Vagina [Mycobacterium, Finegoldia, Lactobacillus, Ato... Palatine Tonsils [Streptococcus, Peptostreptococcus, Granulicat... Posterior Fornix [Lactobacillus, Veillonella, Mycobacterium, An... Right Antecubital Fossa [Propionibacterium, Corynebacterium, Actinomyc... Right Retroauricular Crease [Bacteroides, Propionibacterium, Staphylococcu... Saliva [Capnocytophaga, Veillonella, Streptococcus, P... Stool [Bacteroides, Akkermansia, Clostridium, Rosebu... Subgingival Plaque [Prevotella, Neisseria, Gemella, Actinomyces, ... Supragingival Plaque [Rothia, Streptococcus, Veillonella, Actinomyc... Throat [Rothia, Actinomyces, Veillonella, Porphyromon... Tongue Dorsum [Actinomyces, Streptococcus, Neisseria, Prevot... Vaginal Introitus [Dialister, Lactobacillus, Finegoldia, Anaeroc... Name: GENUS, dtype: object
Now we just need to use the a function or method to determine the length of the array. Our choices of course, are the len() function or the .size property. Let's try them both and see what we get.
# Calculate the length of our arrays
len(merged_subset[merged_subset["count"]>0] \
.dropna(axis = 0, subset = ["GENUS"]) \
.groupby('HMP_BODY_SUBSITE')["GENUS"] \
.unique())
# Retrieve the size of our arrays
merged_subset[merged_subset["count"]>0] \
.dropna(axis = 0, subset = ["GENUS"]) \
.groupby('HMP_BODY_SUBSITE')["GENUS"] \
.unique() \
.size
18
18
apply() method to broadcast a function to individual elements¶Why did both of our sets of code result in a single value - 18? This is actually the number of rows in our series (i.e. the number of HMP_BODY_SUBSITE unique values. That's not at all what we want but with our code as it is, that's what we asked for. In both cases, after our call to unique() we have generated a Series object. Calling on both length() and .size gives us information about the Series but what we want is to retrieve information about the values stored in that object.
Recall that DataFrame objects can broadcast mathematical operations? We can apply a mathematical operation to each individual element of DataFrame. Likewise we can broadcast more complex functions on individual elements by using the apply() method.
The apply() method take the form of apply(func, args=(), **kwargs). We'll talk a little more about some of these parameters in later lectures but for now:
func: the name of the function you want to useargs: a tuple of any additional arguments that are needed for func to work.In our case, we want to apply the .size attribute to each array in our Series. You'll see one more unfamiliar piece of code: lambda. This is how we can tell Python we want to make a quick function on the spot. We'll cover this more in a later lecture as well so for now we'll just roll with it.
# Retrieve the size of our arrays
merged_subset[merged_subset["count"]>0] \
.dropna(axis = 0, subset = ["GENUS"]) \
.groupby('HMP_BODY_SUBSITE')["GENUS"] \
.unique() \
.apply(lambda x: x.size) # We use the lambda notation to define simple, quick functions
HMP_BODY_SUBSITE Anterior Nares 20 Attached Keratinized Gingiva 24 Buccal Mucosa 29 Hard Palate 30 Left Antecubital Fossa 19 Left Retroauricular Crease 19 Mid Vagina 9 Palatine Tonsils 30 Posterior Fornix 10 Right Antecubital Fossa 13 Right Retroauricular Crease 18 Saliva 31 Stool 15 Subgingival Plaque 27 Supragingival Plaque 27 Throat 34 Tongue Dorsum 24 Vaginal Introitus 14 Name: GENUS, dtype: int64
nunique() on a grouped dataframe to return the number of unique elements¶Looking at our code above, we went through 5 steps to get a final answer:
What if instead, we used a helpful method - nunique() to count the number of unique elements in our grouped dataframe. This method simplifies the process by combining the unique() and len() methods together (which we have used previously to achieve the same goal). Furthermore you can ignore the NA values by default. Using this method simplifies our process slightly into just 3 lines of code as we'll see.
# Use the nunique() method to count unique values in a group
merged_subset[merged_subset["count"]>0] \
.groupby('HMP_BODY_SUBSITE')["GENUS"] \
.nunique(dropna = True) \
.sort_values(ascending = False) # Don't forget to sort your values!
HMP_BODY_SUBSITE Throat 34 Saliva 31 Hard Palate 30 Palatine Tonsils 30 Buccal Mucosa 29 Supragingival Plaque 27 Subgingival Plaque 27 Tongue Dorsum 24 Attached Keratinized Gingiva 24 Anterior Nares 20 Left Antecubital Fossa 19 Left Retroauricular Crease 19 Right Retroauricular Crease 18 Stool 15 Vaginal Introitus 14 Right Antecubital Fossa 13 Posterior Fornix 10 Mid Vagina 9 Name: GENUS, dtype: int64
agg() function to generate a summary from multiple functions¶Rather than using just a single function like nunique() you can actually apply multiple functions to a grouped dataframe to generate a summary of the information. In this case we are still only looking at the GENUS column so we are a little limited by the kind of summary we can achieve from string data. We can, however, still count() the number of entries in each group.
To combine both of these methods to produce a summary, we'll use the agg() method, which will accept a list of function names like "count" and "nunique" but also "sum", "min", "max" and other methods found in the GroupBy object.
# Use the nunique() method to count unique values in a group
merged_subset[merged_subset["count"]>0] \
.groupby('HMP_BODY_SUBSITE')["GENUS"] \
.agg(["count", "nunique"])
| count | nunique | |
|---|---|---|
| HMP_BODY_SUBSITE | ||
| Anterior Nares | 128 | 20 |
| Attached Keratinized Gingiva | 194 | 24 |
| Buccal Mucosa | 340 | 29 |
| Hard Palate | 300 | 30 |
| Left Antecubital Fossa | 47 | 19 |
| Left Retroauricular Crease | 135 | 19 |
| Mid Vagina | 36 | 9 |
| Palatine Tonsils | 378 | 30 |
| Posterior Fornix | 40 | 10 |
| Right Antecubital Fossa | 48 | 13 |
| Right Retroauricular Crease | 113 | 18 |
| Saliva | 293 | 31 |
| Stool | 223 | 15 |
| Subgingival Plaque | 390 | 27 |
| Supragingival Plaque | 431 | 27 |
| Throat | 340 | 34 |
| Tongue Dorsum | 348 | 24 |
| Vaginal Introitus | 54 | 14 |
So after a couple of possible directions, we find that the answer is that the throat has the most diversity of genera across our samples with 34 (out of 84 unique genera encountered). We also discovered that while methods like apply() give you a lot of flexibility in what you'd like to calculate with your data, that sometimes there is already a specialized tool like nunique() that accomplishes what you want with less coding on your part!
After more practice and experience some of these functions will become second nature to your coding choices.
# Sort and count entried based on Sex/Genus subgrouping
merged_subset[merged_subset["count"]>0] \
.groupby([...])["GENUS"] \
.agg(["count", "nunique"]) \
.sort_values(..., ascending = False)
pyplot module¶Now that we've had a chance to look at our data close up, let's talk about how we can use exploratory plots to give us a quick visual assessment of our data. We can use these visualizations to help make decisions about how to further analyse our data. Is there a difference between different groups of data? Does it look like there might be any bias between our datasets? What does the overall distribution of our sampling look like?
Often when trying to convey a message about our data through a visualization, we want to choose the right kind of visualization. These visualizations can also be referred to as figures or plots. Within the maplotlib package is the pyplot module. It is a collection of functions that give the matplotlib package capabilities that are very similar to the programming language MATLAB. The pyplot module has functions that can create some of the following basic plots:
| Plot type | Command | What to use it for |
|---|---|---|
| Bar plot / histogram | bar() |
Population data summaries. Helpful for contrasting between groups |
| Scatter plot | scatter() |
Multiple independent measurements across different variables |
| Line plot | plot() |
Multiple measurements that represent the same sample(s) |
| Histogram | hist() |
Generate a distribution by binning your data |
| Stem or lollipop | stem() |
A twist on the bar plot that may be more compact and visually pleasing |
| Boxplots | boxplot() |
Create a visual summary of your datapoints based on their distribution |
| Violin plots | violinplot() |
Create a visual kernel density (distribution) estimate of datapoints |
Within each plot are a number of basic components: titles, axis properties, legends, etc. Here is a helpful table outlining some of the basic plot components.
| Component | Description | Command | Parameters |
|---|---|---|---|
| Title | The title of your plot | title() |
|
| X- or Y-axis title | The axis titles of your plot | xlabel(), ylabel() |
xlabel=str |
| loc={'left', 'center', 'right'} | |||
| Text properties | |||
| X- or Y-axis ticks | Alter your axis tick positions/locations and labels | xticks(), yticks() |
ticks=[a,..n] |
| labels=[label1, ..., labeln] | |||
| Axis limits | A list defining the x- and y-axis limits | axis() |
[xMin, xMax, yMin, yMax] |
| Axis scale | Set the kind of axis scale for your data | xscale(), yscale() |
"linear", "log", "symlog", "logit" |
| Text properties | Labels can take text parameters too | color, fontsize, fontstyle, rotation |
bar() method¶We'll use our body subsite results as an example to try and plot some of our data as a bar plot. The bar() method generally requires two sets of data to be supplied along with some optional data:
x: An array of x-coordinates (group labels, or x values)height: A float or array of bar heights - these are usually the measured/summarized valueswidth: A float or array of bar widths (default is 0.8)bottom: The y-coordinates of the bars (default is 0)align: The alignment of the bars to your x-coordinate labels (default is center)Let's start by building a basic barplot and see what needs to be altered as we move forward. Note that we use the plt.show() method to display our plot after putting a lot of pieces together.
# We'll reset the code cells to only show the last code call. This will de-clutter the plotting process for us
InteractiveShell.ast_node_interactivity = "last"
# Begin by setting our dataset for use
body_diversity = \
merged_subset[merged_subset["count"]>0] \
.groupby('HMP_BODY_SUBSITE')["GENUS"] \
.nunique(dropna = True) \
.sort_values(ascending = False) # Don't forget to sort your values!
# Build our barplot by giving the index as the x-label, and values as the height
plt.bar(x=body_diversity.index,
height=body_diversity)
# Show our plot
plt.show()
figure() to set your figure size¶As we can see from our first attempt, the plot is rather small. We definitely have some problems with the basics of this plot and we'll address the first one that might help. This figure could be a little larger so we can see the x-axis labels better. Use the figure() method to set your figure size using the figsize parameter.
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Use a barplot
plt.bar(x=body_diversity.index,
height=body_diversity)
# Show our plot
plt.show()
xticks() method¶Now that our plot is larger, we can still see the x-axis labels are still too large. We can, however, rotate the text and see if that helps. Let's rotate to a 90-degree angle. We'll alter these axis properties through the xticks() method.
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Use a barplot
plt.bar(x=body_diversity.index,
height=body_diversity)
# Rotate the x-axis text
plt.xticks(rotation = 90)
# Show our plot
plt.show()
Okay, the plot is larger and we've fixed our x-axis label issues. No more overcrowding! Let's add a main title and axis titles to our dataset. We can use the title(), xlabel(), and ylabel() methods in this case. While we set the labels, we can also set their properties such as fontsize, fontstyle, and color.
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Use a barplot
plt.bar(x=body_diversity.index,
height=body_diversity)
# rotate the text
plt.xticks(rotation = 90)
# Add titles
plt.title("OTU genera diversity by body subsite", fontsize = "x-large")
plt.xlabel("Body subsite", fontstyle = "italic")
plt.ylabel("Unique genera", color = "r")
# Show our plot
plt.show()
Similar to altering text properties, most of the plots have the ability to alter various properties like fill and line
color or other plot-specific attributes. Let's update the fill and line colours for our barplot using the color and edgecolor parameters.
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Use a barplot
plt.bar(x=body_diversity.index,
height=body_diversity,
color="orchid",
edgecolor="black"
)
# rotate the text
plt.xticks(rotation = 90)
# Add titles
plt.title("OTU genera diversity by body subsite", fontsize = "x-large")
plt.xlabel("Body subsite", fontstyle = "italic")
plt.ylabel("Unique genera", color = "r")
# Show our plot
plt.show()
?plt.axis
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Use a barplot
plt.bar(x=body_diversity.index,
height=body_diversity,
color="orchid",
edgecolor="black"
)
# rotate the text
plt.xticks(rotation = 90)
# Add titles
plt.title("OTU genera diversity by body subsite", fontsize = "x-large")
plt.xlabel("Body subsite", fontstyle = "italic")
plt.ylabel("Unique genera", color = "r")
# Alter the axis limits
plt.axis([...])
# Show our plot
plt.show()
seaborn package¶Building upon our visualizations in the last section, there are some common themes you might recognize about them. We have a plot area, x- and y-axis data, axis limits, and plot colors. Using matplotlib to help generate your visualizations, you can control many small details but it can also be tedious at times to micromanage so many aspects of your plot.
The seaborn package is actually built upon the pyplot module and tries to bring a high-level approach to statistical plots. As we'll see later on, this means updating certain details of our plots will require an understanding of the base matplotlib and pyplot functions.
seaborn package subdivides plot types into 3 categories¶The seaborn package takes a dual-pronged approach to generating plots. There are functions considered to work at the Figure level and then there are functions that affect what is known as the Axes level. To simplify the concept:
Axes: A single plot defined by an x and y axis grid. This includes all of the basic plots like scatter and box plots.Figure: A plot space that can range from 1 to multiple Axes. The setup of Axes can be simple to complex.At the Axes levels, there are 3 categories of plot types based on their similarity: relational, distribution, and categorical. For each of these categories there is a figure-level function that can be used to create multi-panel (faceted) versions of these plots by splitting the data further by categorical variables.
From the seaborn overview: for most simple plots, one of the above figure or axes-level plots can be utilized. |
The above functions are used to initialize figure and axes objects by identifying a number of properties. Within these, some of the options can vary greatly based on plot type. Of the two levels of functions, their influence on figure attributes can vary:
FacetGrid object that have some additional methods for altering attributes of the plot in a way that makes sense to the subplot organization.Axes they are drawn onto but do not alter the figure in any other way. You can choose to draw onto the current axes in memory OR specify the reference to an axes which may be within a larger figure.One approach to effective data visualization relies on the Grammar of Graphics framework originally proposed by Leland Wilkinson (2005).The idea of grammar can be summarized as follows:
The grammar of graphics is a language to communicate about what we are plotting programmatically
It begins with a tidy data frame. It will have a series of observations (rows) each of which will be described across multiple variables (columns). Variables can actually represent qualitative or quantitative measurements or they could be descriptive data about the experiments or experimental groups.
The data units may undergo conversion through a process called scaling (transformation) before being used for plotting.
A subset of data columns are then passed on to be presented in various data plots (scatterplots, boxplots, kernel density estimates, etc.) by using the data to describe visual properties of the plot. We call these visual properties, the aesthetics of the plot. For example, the data being plotted or represented can be visually altered in shape or colour based on accompanying column data.
A plot can have multiple layers (for example, a scatter plot with a regression line) and each of these plot types is referred to as a geom (short of geometric object).
seaborn¶The grammar of graphics facilitates the concise description of any components of any graphics. Hadly Wickham of R tidyverse fame has proposed a variant on this concept - the layered grammar of graphics framework in the ggplot2 package for R. By following a layered approach of defined components, it can be easy to build a visualization.
In a similar manner, the seaborn package has some methods that facilitate a layering approach to building your visualizations. However, many of the details are built upon the foundation of layering Axes objects or alterations upon Figure objects.
Each Axes-level function usually takes in:
Data: your visualization always starts here. What are the dimensions you want to visualize. What aspect of your data are you trying to convey?
Aesthetics: assign your axes based on the data dimensions you have chosen. Where will the majority of the data fall on your plot? Are there other dimensions (such as categorically encoded groupings) that can be conveyed by aspects like size, shape, colour, fill, etc.
Geometric objects: how will you display your data within your visualization. Which *plot will you use?
The figure-level methods can be used to alter or update:
Scale: do you need alter your x or y-axis limits? What about scaling/transforming any values to fit your data within a range? Sometimes, depending on the Geometric object, you are better off transforming your data ahead of time.
Facets: will generating subplots of the data add a dimension to your visualization that would otherwise by lost?
Coordinate system: will your visualization follow a classis cartesian, semi-log, polar, etc. coordinate system?
Let's jump into our first dataset and start building some plots with it shall we?
metadata_pitlatrine¶Before we dig into the seaborn package, we will import a new dataset that has a much more diverse array of data that we can use for showcasing the plotting power of seaborn.
# We'll reset the code cells to only show all code output.
InteractiveShell.ast_node_interactivity = "last"
# Read the pitlatrine data in from file
pitlatrine_data = pd.read_csv("data/metadata_pitlatrine.csv")
# Look at the first 5 rows of data
pitlatrine_data.head(5)
| Country | Latrine_Number | Depth | pH | Temp | TS | CODt | |
|---|---|---|---|---|---|---|---|
| 0 | Tanzania | 2 | 1 | 7.82 | 25.1 | 14.53 | 874 |
| 1 | Tanzania | 2 | 10 | 9.08 | 24.2 | 37.76 | 102 |
| 2 | Tanzania | 2 | 12 | 8.84 | 25.1 | 71.11 | 35 |
| 3 | Tanzania | 2 | 2 | 6.49 | 29.6 | 13.91 | 389 |
| 4 | Tanzania | 2 | 3 | 6.46 | 27.9 | 29.45 | 161 |
# How big is this dataframe?
pitlatrine_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 81 entries, 0 to 80 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 81 non-null object 1 Latrine_Number 81 non-null int64 2 Depth 81 non-null int64 3 pH 81 non-null float64 4 Temp 81 non-null float64 5 TS 81 non-null float64 6 CODt 81 non-null int64 dtypes: float64(3), int64(3), object(1) memory usage: 4.6+ KB
# What are the unique values in each column?
pitlatrine_data.apply(pd.unique, axis=0)
Country [Tanzania, Vietnam] Latrine_Number [2, 3, 4, 5, 6, 9, 1, 10, 11, 12, 13, 14, 15, ... Depth [1, 10, 12, 2, 3, 6, 7, 9, 5, 4, 8] pH [7.82, 9.08, 8.84, 6.49, 6.46, 7.69, 7.48, 7.6... Temp [25.1, 24.2, 29.6, 27.9, 28.7, 29.8, 25.0, 28.... TS [14.53, 37.76, 71.11, 13.91, 29.45, 65.52, 36.... CODt [874, 102, 35, 389, 161, 57, 107, 62, 384, 372... dtype: object
Taking a quick look at our data we can summarize it briefly here and see that there are a few categories we can explore across variables like Country and Depth for measured variables like pH, Temp, and TS (total solids) and CODt (total chemical oxygen demand).
relplot() method¶We'll start by building a basic scatterplot. We'll focus on comparing the total chemical oxygen demand versus the total solids count. Rather than working at the axes-level, we'll work with the encompassing relplot() method which will give us flexibility in our visual exploration down the road.
For the basic plot relplot(), we'll start with the following parameters:
data: The tidy (long-form) data set we want to visualize.x, y: The variable names for assigning x- and y-axis values.height: The total height of our figure.aspect: A scalar value used to determine the width of your figure (width = height * aspect).kind: The type of plot we want to produce: scatter (default) vs linekwargs: This is a catch-all for any other keyword arguments that could be passed onto underlying functions like the Axes-level methods.# Import your seaborn package
import seaborn as sns
# We'll build a scatterplot
snsPlot = sns.relplot(data = pitlatrine_data, # Set the data
height = 6, aspect = 1, # Set the size of the figure
kind = "scatter" # Set the figure type
)
# Show the plot
plt.show()
Looking at the output we can see that without having specified the x and y axis values, it simply plotted all of the variables along a default x-axis of index number (80 rows total). Let's try again by setting our axis variables.
# We'll build a scatterplot
snsPlot = sns.relplot(data = pitlatrine_data,
x = "TS", y = "CODt", # Set the axis variables
height = 6, aspect = 1,
kind = "scatter"
)
# Show the plot
plt.show()
hue parameter¶Now we begin to see the power of having a tidy DataFrame. Since each of our observations is in its own row, we can classify each observation by factors like Country! Using the seaborn package, we can specify the colour of our points using the hue parameter. Since we have set the data = pitlatrine_metadata parameter, we can tell seaborn to look at a specific column when determining the hue parameter.
At the same time, we'll set the alpha parameter which is essentially the opacity of each datapoint. Setting a lower value increases transparency which allows us to see overlapping datapoints better. This parameter becomes especially helpful when working with extremely dense datapoints.
# We'll build a scatterplot
snsPlot = sns.relplot(data = pitlatrine_data,
x = "TS", y = "CODt",
hue = "Country", # Set the point-colour by Country
alpha = 0.6, # Set the transparency of the points
height = 6, aspect = 1,
kind = "scatter"
)
# Show the plot
plt.show()
set method¶The set() method is a gateway to altering a number of aspects of your plot. Once we have our plot saved as an object named snsplot we can alter or set some of it's properties this way. In particular we will be using the yscale parameter to set adjust the y-axis log scale.
Note that our plot object is actually kept as a type of matplotlib.axes.Axes and that's where we are calling the set() method from. We can actually set quite a few figure attributes through this method.
# We'll build a scatterplot
snsPlot = sns.relplot(data = pitlatrine_data,
x = "TS", y = "CODt",
hue = "Country",
alpha = 0.6,
height = 6, aspect = 1,
kind = "scatter"
)
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot
plt.show()
relplot() method¶If we want to split our data into multiple new plots based on certain variables, this is known as faceting your data. Usually this results in a grid-like pattern where data is grouped by categories of one variable as columns, and another variable by rows. Although the data could simply be split on a single variable instead. Either way, this generates a figure-level object known as a FacetGrid().
The relplot() method already has the capability to handle this splitting of axes within the figure it generates. The relplot() method can facet data across two variables using the row and col parameters. We simply need to name the variable(s) that will be used to categorize the data.
To summarize, the parameters to use for this operation are:
col: The variable name that will be used to group the columns of your grid.row: The variable name that will be used to group the rows of your grid.Below, we'll remove the colouring of points based on Country and instead, split the data into two Axes based on this information.
# We'll build a scatterplot
snsPlot = sns.relplot(data = pitlatrine_data,
x = "TS", y = "CODt",
alpha = 0.6,
height = 6, aspect = 0.6, # Set the size of the figure
kind = "scatter",
col = "Country" # Split the columns of the grid by country
)
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot
plt.show()
Note that we typically need to just apply one attribute to each dimension of data we are investigating. By splitting the data by Country we no longer need to colour it based on this category. We can, however, add additional information to our visualization by using another dimension in our data. Instead of colouring the points based on a categorical variable (Tanzania vs Vietnam), we can use a continuous variable like Temp from our dataset to see if there could be a trend in relation to our data.
It is easy enough to set this dimension using the hue parameter in our initial relplot() call. We'll also set the palette parameter to a different colour.
# We'll build a scatterplot
snsPlot = sns.relplot(data = pitlatrine_data,
x = "TS", y = "CODt",
hue = "Temp", palette = "Reds", # Set the colouring based on Temp
alpha = 0.6,
height = 6, aspect = 0.6,
kind = "scatter",
col = "Country" # Split the columns of the grid by country
)
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot
plt.show()
By colouring our datapoints by temperature, we can now see more clearly on the same plot that latrines samples from Tanzania generally are measured at a higher temperature. Rather than generating additional plots comparing different pairs of variables, we've simply added an additional dimension of information to our visualization.
set_axis_labels() method¶The names of our axis titles are drawn from the variable names we used for the original DataFrame but we may be limited in how those variables are originally named. In other cases you may wish to add units, or simply make your axis titles more descriptive. To accomplish this we can alter our labels directly using the set_axis_labels() method. The parameters to set are x_var and y_var in that order. Set them directly if you only want to change a single axis title.
# We'll build a scatterplot
snsPlot = sns.relplot(data = pitlatrine_data,
x = "TS", y = "CODt",
hue = "Temp", palette = "Reds", # Set the colouring based on Temp
alpha = 0.6,
height = 6, aspect = 0.6,
kind = "scatter",
col = "Country" # Split the columns of the grid by country
)
# Set the axis titles (Aesthetics)
snsPlot.set_axis_labels("Total solids", "Total chemical oxygen demand")
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot
plt.show()
style parameter¶We'll switch gears a little at this point and ask what our data looks like when we compare the Depth of our latrine measurements instead. There are many depth measurements (11 total) which will be very hard to read if they are presented in a single row so we'll also use the col_wrap parameter to set the number of facets per row. Note that this will not be compatible with faceting by both a column and row variable.
We'd still like to distinguish between our two different Country values so we'll assign the shape of these values using the style parameter instead.
# We'll build a scatterplot
snsPlot = sns.relplot(data = pitlatrine_data,
x = "TS", y = "CODt",
hue = "Temp", palette = "Reds", # Set the colouring based on Temp
style = "Country", # Set the point style based on Country
edgecolor = "black", # Set the edge colour of our points
col_wrap = 4, # Keep the number of facets to 4 per row
alpha = 0.6,
height = 4, aspect = 0.5,
kind = "scatter",
col = "Depth" # Split the columns of the grid by Depth
)
# Set the axis titles (Aesthetics)
snsPlot.set_axis_labels("Total solids", "Total chemical oxygen demand")
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot
plt.show()
Now that we have some of the basics, it's time to take a closer look at using other types of plots. We'll begin by reviewing the latrines dataset after loading it into memory as the variable latrines.
# Load the pitlatrine data
latrine_OTU_data = pd.read_csv("data/taxa_pitlatrine.csv")
latrine_OTU_data.head()
| Taxa | Country | Latrine_Number | Depth | OTUs | |
|---|---|---|---|---|---|
| 0 | Acidobacteria_Gp1 | Tanzania | 2 | 1 | 0 |
| 1 | Acidobacteria_Gp10 | Tanzania | 2 | 1 | 0 |
| 2 | Acidobacteria_Gp14 | Tanzania | 2 | 1 | 0 |
| 3 | Acidobacteria_Gp16 | Tanzania | 2 | 1 | 0 |
| 4 | Acidobacteria_Gp17 | Tanzania | 2 | 1 | 0 |
latrine_OTU_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4212 entries, 0 to 4211 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Taxa 4212 non-null object 1 Country 4212 non-null object 2 Latrine_Number 4212 non-null int64 3 Depth 4212 non-null int64 4 OTUs 4212 non-null int64 dtypes: int64(3), object(2) memory usage: 164.7+ KB
kdeplot() or displot()¶There are a lot of datapoints in our new dataset. This time our measurements are based on OTU counts for specific taxa within different latrines across Tanzania and Vietnam. We already looked at a lot of the metadata about each latrine samples in the pitlatrine_data from our previous plots. A quick question we can ask about the data is "what is the overall distribution of OTU counts in our data?"
A quick way to answer this question is be generating a kernel density estimate (KDE) using the displot() method from seaborn. We'll need to provide an x value (OTUs in this case) and we'll colour our plots based on Country.
# We'll build a KDE plot
snsPlot = sns.displot(data = latrine_OTU_data,
x = "OTUs",
hue = "Country",
height = 6, aspect = 2,
kind = "kde",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Show the plot
plt.show()
# What are some general statistics about our data?
latrine_OTU_data.describe()
| Latrine_Number | Depth | OTUs | |
|---|---|---|---|
| count | 4212.000000 | 4212.000000 | 4212.000000 |
| mean | 9.271605 | 2.950617 | 163.224122 |
| std | 6.222985 | 2.205201 | 875.666289 |
| min | 1.000000 | 1.000000 | 0.000000 |
| 25% | 4.000000 | 1.000000 | 0.000000 |
| 50% | 8.000000 | 2.000000 | 0.000000 |
| 75% | 14.000000 | 4.000000 | 7.000000 |
| max | 22.000000 | 12.000000 | 17572.000000 |
Well it looks like the majority of our distribution is close to a value of 0, and our OTUs can range as high as 17,572! Let's simplify the dataset we'll look at by filtering it. We can do so right in our call to kdeplot(). In this case we'll set the OTU values to be between 2 and 1000 by filtering using two sets of conditionals and combining them with the &. Be sure to use parentheses correctly here to ensure that we are evaluating things properly and separating our expressions correctly for the Python kernel.
# Filter your data range
latrine_filtered = latrine_OTU_data[(latrine_OTU_data["OTUs"]>2) &
(latrine_OTU_data["OTUs"]<1000)]
# We'll build a KDE plot
snsPlot = sns.displot(data = latrine_filtered,
x = "OTUs",
hue = "Country",
height = 6, aspect = 2,
kind = "kde",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Show the plot
plt.show()
rugplot() to the margin¶Within seaborn there are a few ways marginal plots that can be added to your visualizations. Marginal plots usually add distribution summaries like a histogram, kde or in our case, a rugplot. More specifically, we'll be using the rugplot() method but there is also the ability to create certain plot combinations using the jointplot() method.
A rugplot is simply a series of vertical or horizontal tick-marks representing our actual data points along the x- and/or y-axis. For our rugplot, we'll be add it outside of our plot area by manipulating the height and clip_on parameters to help us out. We'll also set the alpha parameter to help us see the density of our tick marks a little better. Despite where it is plotted, we are adding this plot on the underlying Axes object of the current snsPlot figure.
# Filter your data range
latrine_filtered = latrine_OTU_data[(latrine_OTU_data["OTUs"]>2) & (latrine_OTU_data["OTUs"]<1000)]
# We'll build a KDE plot
snsPlot = sns.displot(data = latrine_filtered,
x = "OTUs",
hue = "Country",
height = 6, aspect = 2,
kind = "kde",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Add a rugplot - this is plotted on top of the current axes object
sns.rugplot(data = latrine_filtered,
x = "OTUs",
height = -0.02, clip_on = False,
alpha = 0.5
)
# Show the plot
plt.show()
Boxplots are a great way to visualize summary statistics for your data. As a reminder, the thick line in the center of the box is the median. The upper and lower ends of the box are the first and third quartiles (or 25th and 75th percentiles) of your data. The whiskers extend to the largest value no further than 1.5*IQR (inter-quartile range - the distance between the first and third quartiles). Data beyond these whiskers are considered outliers and plotted as individual points. This is a quick way to see how comparable your samples or variables are.
We are going to use boxplots to see the distribution of OTUs per Taxa across all samples.
catplot() method¶To build the basic boxplot we begin with the main variables. We want to summarize the OTU counts (y-axis) for each taxa (x-axis) to get a sense of the OTU count distribution for each taxon. We can use the catplot() method to build our visualization. You'll note that the plot is automatically coloured to differentiate between the x-axis groups.
The catplot() method is the gateway to more categorical plots and behaves similarly to the other two figure-level plots we've encountered.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = latrine_OTU_data,
x = "Taxa", y = "OTUs",
kind = "box", # Make a boxplot
height = 6, aspect = 2 # Set the width of our plot to 6 and width at 12
)
# Show the plot
plt.show()
pyplot with xticks()¶We've encountered this problem before the the solution is actually the same. Recall that seaborn is built upon the back of the pyplot module so we can actually modify the plot directly through pyplot!
While we're here, let's update the y-axis scale as we have before using the set() method on our figure object.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = latrine_OTU_data,
x = "Taxa", y = "OTUs",
kind = "box", # Make a boxplot
height = 6, aspect = 2 # Set the width of our plot to 4 and width at 12
)
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Transform the y-axis scale
snsPlot.set(yscale="log")
# Show the plot
plt.show()
Uh Oh! Something strange has happened to our data. Recall our data is filled with 0 counts! When we try to log-transform a 0 what is the result? Call on np.log(0) to convince yourself that you've created a problem with your visualization.
The problem is that our data has already been plotted as a boxplot, all of the data has been calculated. We then transform all of the data values so the boxplot summary values are merely transformed and redrawn in the same way. Having -inf values and trying to calculate these statistics correctly fails miserably because of the existence of 0 values.
Therefore we must filter our data for 0 values beforehand to avoid this issue. We'll sort our taxa as well to get them in alphabetical order for our boxplot. On top of this, however, you might surmise that our interquartile ranges, when warped along a log-transform scale might not make sense anymore. Instead, we should log-transform our data before plotting so that the summary statistics of the data can be properly calculated for that particular scale.
To accomplish this data transformation and creation of a new column, we'll use the assign() method which allows us to create a new column within the DataFrame using the syntax `DataFrame.assign(new_col=expression_calculation)
# Filter the data
latrine_filtered = latrine_OTU_data[(latrine_OTU_data["OTUs"]>0)] \
.assign(OTU_log=latrine_filtered.loc[:,"OTUs"].apply(...)) \
.sort_values("Taxa") # Sort your data by Taxa
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
kind = "box", # Make a boxplot
height = 6, aspect = 2 # Set the width of our plot to 6 and width at 12
)
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Show the plot
plt.show()
hue parameter¶We already know that our data is measured across two countries so we can take advantage of this information to create a nested (paired/grouped) set of boxplots. When you have a smaller number of categories, this allows you to more directly compare the characterstics of your two populations. Let's see what happens when we use the hue parameter to distinguish between our country-level data.
We'll also set the boxplot() parameter width to put a little more distance between each category along the x-axis. This value usually ranges between 0 and 1.
To save on some space we'll additionally move the legend into the plot using the legend_out boolean parameter.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = latrine_filtered,
x = "Taxa", y = ...,
kind = "box",
height = 6, aspect = 2,
hue = ..., # Set the boxplot nesting by Country
legend_out = ..., # Move the legend inside the plot
width = 0.6 # Put some more distance between categories by decreasing width
)
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Show the plot
plt.show()
row and col parameters¶So our plot above is quite busy. Whereas previously, we used the relplot() method to generate a faceted scatterplot (or relational plot), here we'll use the catplot() method to accomplish something similar. The catplot() method handles the distribution or faceting of categorical plots using similar parameters:
col: The variable name that will be used to group the columns of your grid.row: The variable name that will be used to group the rows of your grid.This time around we'll facet our data into a stacked set of plots using the row parameter. You'll notice that since we are no longer grouping by hue that the legend will also disappear.
It's time we fixed that y-axis label as well using the set_ylabels method. We should convey that the data scale is log10.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
kind = "box",
height = 4, aspect = 3, # We'll alter the sizes of our plots
row = ... # Generate facets by row
)
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Change our y-axis label
snsPlot...("Operational taxonomic units (log 10)")
# Show the plot
plt.show()
swarmplot()¶Even though boxplots give us summary statistics on our data, it is useful to readers (and reviewers) to be able to see where our individual data points are. We've already used rugplot() to help visualize our data distribution in density plots.
Similarly, for a boxplot we can add the data as another layer using swarmplot() to place dots on top of our boxplot. A swarmplot places data points that are overlapping next to each other, so we can get a better picture of the distribution of our data.
For simplicity we will subset the data to make a boxplot with 3 Taxa so that we can see the differences it makes when adding a complementary geom like this.
# Start by filtering your data
latrine_filtered = latrine_OTU_data[(latrine_OTU_data["OTUs"]>0) &
# filter for specific taxa
(latrine_OTU_data["Taxa"]...(["Bacilli", "Clostridia", "Unknown"]))
] \
.assign(OTU_log=latrine_filtered.loc[:,"OTUs"].apply(np.log)) \
.sort_values("Taxa") # Sort your data by Taxa
# Add our boxplot from before
snsPlot = sns.catplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
hue = "Country",
kind = "box",
height = 6, aspect = 2 # We'll alter the sizes of our plots
)
# Change our y-axis label
snsPlot.set_ylabels("Operational taxonomic units (log 10)")
# Show the plot
plt.show()
hue when using grouped boxplots¶Now that we have our base boxplot, we'll add the swarmplot as a layer. It's important to note that we are adding this layer after our initial call to catplot() but before we update our axis label. You can try on your own to see what happens if we add the swarmplot data after re-labeling the y-axis.
A couple of notes about the parameters we are using with the swarmplot:
hue: We are setting this parameter to ensure that our points are coloured within the Taxa category by Country.dodge: This boolean is required to signal that we want to separate our points (based on hue) into their own groups. color: We'll take advantage of this parameter to set all points back to black for contrast.We'll also update the fliersize parameter in our catplot() to hide our outliers. Otherwise we'll be plotting those points twice when we add the swarmplot.
# Add our boxplot from before
snsPlot = sns.catplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
hue = "Country",
kind = "box",
height = 6, aspect = 2, # We'll alter the sizes of our plots
legend_out = False,
fliersize = ... # Hide our outliers by making them size 0
)
# Add the swarmplot to the current axes
sns.swarmplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
hue = ..., dodge = ..., # colour and split the points by Country
color = ..., # Recolour all of the points to black
)
# Change our y-axis label
snsPlot.set_ylabels("Operational taxonomic units (log 10)")
# Show the plot
plt.show()
If you could combine aspects of the boxplot and the KDE into a single visualization, you would think it was the violin plot. Another way to think of the violin plot is as the KDE plot that's been shrunk down and placed categorically.
It's actually quite easy to switch over since many of the aspects are similar to the boxplot. We need only change the kind parameter in our catplot() code.
# Alter our boxplot from before
snsPlot = sns.catplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
hue = "Country",
kind = ..., # It's a simple switch in the kind of plot!
height = 6, aspect = 2,
legend_out = False
)
# Add the swarmplot to the current axes
sns.swarmplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
hue = "Country", dodge = True, # colour and split the points by Country
color = "black" # Recolour all of the points to black
)
# Change our y-axis label
snsPlot.set_ylabels("Operational taxonomic units (log 10)")
# Show the plot
plt.show()
split parameter¶Sometimes a more direct comparison of your data can be applied through the violin plot by generating a split version of it. This is especially helpful when you are working with nested data that is binary and you would like to compare it visually.
We'll initialize this visualization with the split boolean parameter. To help with this visualization we'll also include some inner markers of the quartile information in each violin.
# Alter our boxplot from before
snsPlot = sns.catplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
hue = "Country",
kind = "violin", # It's a simple switch in the kind of plot!
height = 6, aspect = 1.5,
legend_out = False,
split = ..., # This will create hybrid violin plots
inner = ... # Add quartile markers to each half of the violin
)
# Change our y-axis label
snsPlot.set_ylabels("Operational taxonomic units (log 10)")
# Show the plot
plt.show()
Up until now, we have taken for granted that our plots have been displayed using a Graphic Device. For our Jupyter Notebooks we can see the graphs right away and update our code. You can even save them manually from the output display but sometimes you may be producing multiple visualizations based on large data sets. In this case it is preferable to save them directly to file.
plt.savefig() method¶Once you have a figure the way you want it, you can save in any number of graphical and non-graphical formats. The savefig() method from the pyplot package is here to save the day. Saving the current figure, you can use some of the following parameters:
fname: The path to your file you want to save, including the extension. If format is not set, then the file extension will be used to infer the format instead.dpi: The resolution in dots per inch for your figure.format: The file format you'd like to use. Supported filetypes include svg, jpg, eps, and pdf.Let's save our faceted scatterplot from way back in section 3.2.6.
# We'll build a scatterplot
snsPlot = sns.relplot(data = pitlatrine_data,
x = "TS", y = "CODt",
hue = "Temp", palette = "Reds", # Set the colouring based on Temp
style = "Country", # Set the point style based on Country
edgecolor = "black", # Set the edge colour of our points
col_wrap = 4, # Keep the number of facets to 4 per row
alpha = 0.6,
height = 4, aspect = 0.5,
kind = "scatter",
col = "Depth" # Split the columns of the grid by Depth
)
# Set the axis titles (Aesthetics)
snsPlot.set_axis_labels("Total solids", "Total chemical oxygen demand")
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Save the plot
plt.savefig(...
)
Up until now we've been generating a combination of either faceted plots or simply layering elements upon single plots. Throughout all of these we have not really been mixing the types of plots we've generated. Luckily for us, the matplotlib.pyplot module provides a means for us to put together multiple plot axes in a single figure.
We have already seen some of these figure-level functions in action with relplot() and catplot() which provided an interface to axes-level methods like scatterplot() or boxplot when created faceted plots.
| A handy figure from the seaborn overview at: https://seaborn.pydata.org/tutorial/function_overview.html |
What if, however, we would like to create a figure with multiple axes of different types?
subplot2grid() to generate a multi-grid plot¶When generally considering the layout of our data, we want to think of breaking up the figure into a grid. This can start off simply as a 1x1 panel and expand outwards with nested panels within a 2x2 or 3x3 or larger figure. The dimensions also are not limited to square shapes but can be rectangular as well.
The subplot2grid takes the following parameters:
shape: The dimensions of the figure given in a (numRow, numCol) tupleloc: The location of the subplot (Axes object) you are creating. Position 0,0 is the top left corner.rowspan: The dimensions of the subplot in number of rows.colspan: The dimensions of the subplot in number of columns.fig: A figure object to place the Axes object in. Otherwise the current figure is used.| Some simple layouts demonstrating how a figure can be subdivided |
Let's make a figure with 3 plots as seen in the 4th example above. We'll begin just by generating the specific layout.
# Initialize a figure
fig = plt.figure()
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 2), (...), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 2), (...), colspan=1)
ax3 = plt.subplot2grid((2, 2), (...), colspan=1)
plt.show()
Axes objects as a canvas to plot onto fig¶Now you can see that our figure encompasses the 3 panels that we envisioned. You'll notice that we named each panel with a reference/variable. We could have also added them to a single list object to save on variable names but to simplify our understanding they've been named separately.
Why do we need these objects? Each Axes-level function we use to plot with, can take a parameter ax where we pass along an Axes object. This identifies which panel we want to plot onto in our overall figure. Otherwise it will use the last Axes object generated (ie ax3). We'll fill our Axes as follows:
ax1: A scatterplot of our pitlatrine data using CODt vs TS.ax2: A violin plot with swarmplot overlay of our filtered latrine dataax3: A kde plot of the overall OTU counts in our latrine data contrasted between our two countriesThere is, however, a quick hitch. We can no longer rely directly on the figure-level functions of relplot, catplot and displot. In order to plot correctly on these subpanels, we'll need to use the axes-level function calls instead. We'll add our plots one at a time so you can see the effect of each.
# Regenerate our filtered latrine data just in case something has happened to it
latrine_filtered = latrine_OTU_data[(latrine_OTU_data["OTUs"]>0) &
# filter for specific taxa
(latrine_OTU_data["Taxa"].isin(["Bacilli", "Clostridia", "Unknown"]))
] \
.assign(OTU_log=latrine_filtered.loc[:,"OTUs"].apply(np.log)) \
.sort_values("Taxa") # Sort your data by Taxa
# Initialize a figure
fig = plt.figure(figsize=(10, 8))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 2), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 2), (0, 1), colspan=1)
ax3 = plt.subplot2grid((2, 2), (1, 1), colspan=1)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
sns.scatterplot(data = pitlatrine_data,
x = "TS", y = "CODt",
hue = "Country", # Set the colouring based on Country
style = "Country", # Set the point style based on Country
edgecolor = "black", # Set the edge colour of our points
alpha = 0.6,
ax = ...
)
# Directly set the axes y-axis label
...set_ylabel("Total chemical oxygen demand")
# Show the plot
plt.show()
Axes¶Next we'll populate the second panel (top-right) with a combination of violinplot() and swarmplot(). We'll adjust the swarmplot slighltly to make the size of the points smaller as well using the size parameter.
# Initialize a figure
fig = plt.figure(figsize=(10, 8))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 2), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 2), (0, 1), colspan=1)
ax3 = plt.subplot2grid((2, 2), (1, 1), colspan=1)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
sns.scatterplot(data = pitlatrine_data,
x = "TS", y = "CODt",
hue = "Country", # Set the colouring based on Country
style = "Country", # Set the point style based on Country
edgecolor = "black", # Set the edge colour of our points
alpha = 0.6,
ax = ax1 # Give it ax1 as a plot area
)
# Directly set the axes y-axis label
ax1.set_ylabel("Total chemical oxygen demand")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
sns.violinplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log", # Set the x and y-axis
hue = "Country",
ax = ax2 # Give it ax2 as a plot area
)
# Add the swarmplot to the current axes
sns.swarmplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
hue = "Country", dodge = True, # colour and split the points by Country
color = "black", alpha = 0.9, # Recolour all of the points to black
ax = ..., # Give it ax2 as a plot area
size = ... # scale down the size of the points
)
# Directly set the ax2 y-axis label
ax2.set_ylabel("Operational taxonomic units (log 10)")
# Show the plot
plt.show()
Axes methods¶So you can see there's a slight issue above with our legend in the violin plot. Due to size issues, it's dropped right into the bottom-middle of the plot. It doesn't really have much meaning for us so we will remove it with the get_legend().remove() command. This will access the Axes object legend and remove it for us.
# Initialize a figure
fig = plt.figure(figsize=(10, 8))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 2), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 2), (0, 1), colspan=1)
ax3 = plt.subplot2grid((2, 2), (1, 1), colspan=1)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
sns.scatterplot(data = pitlatrine_data,
x = "TS", y = "CODt",
hue = "Country", # Set the colouring based on Country
style = "Country", # Set the point style based on Country
edgecolor = "black", # Set the edge colour of our points
alpha = 0.6,
ax = ax1 # Give it ax1 as a plot area
)
# Directly set the axes y-axis label
ax1.set_ylabel("Total chemical oxygen demand")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
sns.violinplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log", # Set the x and y-axis
hue = "Country",
ax = ax2 # Give it ax2 as a plot area
)
# Add the swarmplot to the current axes
sns.swarmplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
hue = "Country", dodge = True, # colour and split the points by Country
color = "black", alpha = 0.9, # Recolour all of the points to black
ax = ax2, # Give it ax2 as a plot area
size = 3 # scale down the size of the points
)
# Directly set the ax2 y-axis label
ax2.set_ylabel("Operational taxonomic units (log 10)")
# Remove the legend from this subplot
ax2...
# Show the plot
plt.show()
Let's complete the set by adding a KDE plot to our final panel. We'll filter our dataset on the fly as we pass it to the kdeplot() function.
# Initialize a figure
fig = plt.figure(figsize=(10, 8))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 2), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 2), (0, 1), colspan=1)
ax3 = plt.subplot2grid((2, 2), (1, 1), colspan=1)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
sns.scatterplot(data = pitlatrine_data,
x = "TS", y = "CODt",
hue = "Country", # Set the colouring based on Country
style = "Country", # Set the point style based on Country
edgecolor = "black", # Set the edge colour of our points
alpha = 0.6,
ax = ax1 # Give it ax1 as a plot area
)
# Directly set the axes y-axis label
ax1.set_ylabel("Total chemical oxygen demand")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
sns.violinplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log", # Set the x and y-axis
hue = "Country",
ax = ax2 # Give it ax2 as a plot area
)
# Add the swarmplot to the current axes
sns.swarmplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
hue = "Country", dodge = True, # colour and split the points by Country
color = "black", alpha = 0.9, # Recolour all of the points to black
ax = ax2, # Give it ax2 as a plot area
size = 3 # scale down the size of the points
)
# Directly set the ax2 y-axis label
ax2.set_ylabel("Operational taxonomic units (log 10)")
# Remove the legend from this subplot
ax2.get_legend().remove()
# ---------- Plot 3 ----------#
# We'll build a KDE plot
snsPlot = sns.kdeplot(data = latrine_OTU_data[(latrine_OTU_data["OTUs"]>2) &
(latrine_OTU_data["OTUs"]<1000)], # Filter the data
x = "OTUs",
hue = "Country",
fill = True,
ax = ... # Give it ax3 as a plot area
)
# Show the plot
plt.show()
Axes using tight_layout()¶Nearly there, we can see there are some overlapping text issues from the y-axis labels of the KDE plot onto the scatter plot beside it. This is due to the size of the y-tick text itself pushing the title too far left. We can ask the plot to fix these spacing issues using the plt.tight_layout() method, which should fix the overlapping issues as best as it can.
# Initialize a figure
fig = plt.figure(figsize=(10, 8))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 2), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 2), (0, 1), colspan=1)
ax3 = plt.subplot2grid((2, 2), (1, 1), colspan=1)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
sns.scatterplot(data = pitlatrine_data,
x = "TS", y = "CODt",
hue = "Country", # Set the colouring based on Country
style = "Country", # Set the point style based on Country
edgecolor = "black", # Set the edge colour of our points
alpha = 0.6,
ax = ax1 # Give it ax1 as a plot area
)
# Directly set the axes y-axis label
ax1.set_ylabel("Total chemical oxygen demand")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
sns.violinplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log", # Set the x and y-axis
hue = "Country",
ax = ax2 # Give it ax2 as a plot area
)
# Add the swarmplot to the current axes
sns.swarmplot(data = latrine_filtered,
x = "Taxa", y = "OTU_log",
hue = "Country", dodge = True, # colour and split the points by Country
color = "black", alpha = 0.9, # Recolour all of the points to black
ax = ax2, # Give it ax2 as a plot area
size = 3 # scale down the size of the points
)
# Directly set the ax2 y-axis label
ax2.set_ylabel("Operational taxonomic units (log 10)")
# Remove the legend from this subplot
ax2.get_legend().remove()
# ---------- Plot 3 ----------#
# We'll build a KDE plot
snsPlot = sns.kdeplot(data = latrine_OTU_data[(latrine_OTU_data["OTUs"]>2) &
(latrine_OTU_data["OTUs"]<1000)], # Filter the data
x = "OTUs",
hue = "Country",
fill = True,
ax = ax3 # Give it ax3 as a plot area
)
# Re-adjust the axes to remove overlap due to axis text
plt...
# Show the plot
plt.show()
Not too shabby! And we're done!
That's our fourth class on Python! You've made it through and we've learned about taking advantage of in-built DataFrame methods for exploratory data analyis as well as how to finally visualize some of your data:
groupby() and summarizing.matplotlib.pyplot module.seaborn package.At the end of this lecture a Quercus assignment portal will be available to submit your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.7% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1700 hours the following day).
Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete the Introduction to Data Visualization with Seaborn course (4 chapters, 3700 possible points). This is a pass-fail assignment, and in order to pass you need to achieve a least 2775 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this:
| A sample screen shot for one of the DataCamp assignments. You'll want to combine yours into single images or PDFs if possible |
Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 13:59 hours on Thursday, February 10th to submit your assignment (right before the next lecture).
Revision 1.0.0: materials prepared for CSB1021H S LEC0140, 01-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.